In the first assignment, we utilized the RNASeq data from Sanghi et al.’s publication “Chromatin Accessibility Associates with Protein-RNA Correlation in Human Cancer”(Sanghi, Gruber, and Metwally (2021)) to perform data cleaning, normalization, gene mapping to HUGO symbols, and preliminary analysis. The goal of the original study was to explore the relationship between chromatin structure alterations and molecular phenotypes in cancer by utilizing multi-omics profiling of human tumors. They applied this approach to a total of 36 individuals to obtain thyroid cancer primary tumors, metastases, and patient-matched normal tissue, with a total of 87 samples. The study identified a local chromatin structure that is highly correlated with coordinated RNA and protein expression, particularly within gene-body enhancers, and claimed that local enhancers may be more important for regulating cancer gene expression than distal enhancers. Moreover, the authors found that TFs in the MAPK pathway are actively bound significantly more in tumor and metastases than in normal tissue.
We obtained the dataset with the ID GSE162515 from GEO, which is linked to the study “Chromatin Accessibility Associates with Protein-RNA Correlation in Human Cancer”. The dataset contains a total of 28,883 genes, and the experiment conditions are categorized into 27 normal tissue samples, 30 tumor tissue samples, and 30 metastases tissue samples.
To improve the data quality, we removed genes with less than 1 count per million (cpm) in less than three samples, resulting in the removal of 11,538 genes. We then performed normalization using TMM with the edgeR package to correct the large deviation of means between the tumor, metastases and normal tissue groups while still preserving some of the original sample distribution. Interestingly, the data was already well-aligned after removing low counts, so the normalization step barely improved the quality of the dataset. Figures such as MDS plot and BCV plot are generated to visualize the quality, as shown below.
From the post-normalization MDS plot, we observed that the overall separation between each test condition group (T and M) and the normal group is clear, indicating a good dataset quality.
Additionally, the variance of the data was relatively consistent with the expected trend, as indicated by the dispersion-squared BCV plot.
To map identifiers, we utilized the package biomaRt. Fortunately, gene IDs in the original dataset were mostly already mapped to the corresponding HUGO gene symbols. After removing low counts, genes with duplicate identifiers, and genes that cannot be mapped to HUGO symbols, the final dataset contained 17,368 unique genes.
In this section, we import and install the necessary packages for this assignment, in which we will conduct a differential expression analysis using the normalized dataset and a thresholded over-representation analysis.
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")}
if (!requireNamespace("GEOmetadb", quietly = TRUE)){
BiocManager::install("GEOmetadb")}
if (!requireNamespace("circlize", quietly = TRUE))
install.packages("circlize")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
if (!requireNamespace("gprofiler2", quietly = TRUE))
BiocManager::install("gprofiler2")
if (!requireNamespace("ggplot2", quietly = TRUE)){
install.packages("ggplot2")}
if (!requireNamespace("edgeR", quietly = TRUE)){
BiocManager::install("edgeR")}
if (!requireNamespace("biomaRt", quietly = TRUE)){
BiocManager::install("biomaRt")}
if (!requireNamespace("knitr", quietly = TRUE)){
install.packages("knitr")}
if (!requireNamespace("GEOquery", quietly = TRUE)){
BiocManager::install("GEOquery")}
if (!requireNamespace("Biobase", quietly = TRUE)){
BiocManager::install("Biobase")}
if (!requireNamespace("dplyr", quietly = TRUE)){
install.packages("dplyr")}
if (!requireNamespace("kableExtra", quietly = TRUE)){
install.packages("kableExtra")}
Load packages
library("GEOmetadb")
library("ggplot2")
library("edgeR")
library("biomaRt")
library("ComplexHeatmap")
library("circlize")
library("dplyr")
library("GEOquery")
library("Biobase")
library("knitr")
library("kableExtra")
In this section, we will use edgeR package to perform differential expression analysis.
The normalized data processed from Assignment 1 has been stored in the file named as “normalized_counts_final.txt”. We now load this data into R. The categories for the samples were saved in “samples.txt” file. We can also load it into R.
normalized_counts <- read.table("normalized_counts_final.txt")
samples <- read.table("samples.txt")
The format of the normalized count dataset has already been processed in a way that can be directly used to plot the heatmap, which would be our next step.
kable(normalized_counts[1:5,1:5], format = "html")
| F001.C1.T | F002.C1.N | F003.C1.M | F004.C2.T | F005.C2.N | |
|---|---|---|---|---|---|
| A1BG-AS1 | 9.941028 | 4.706365 | 11.072142 | 5.9720957 | 3.615496 |
| A2M | 639.332343 | 806.127267 | 728.618408 | 565.2189214 | 515.208115 |
| A2M-AS1 | 0.379692 | 2.312610 | 1.428664 | 0.6846989 | 2.824606 |
| A4GALT | 6.282177 | 14.200239 | 16.310575 | 11.2214538 | 9.151723 |
| AAAS | 29.512426 | 30.713088 | 28.751854 | 34.8055261 | 24.178626 |
There are two columns, “individuals” and “tissue type”. The “individual” column stores data of the patients who the authors took the samples from, and the “tissue type” indicates the type of the tissue, including tumors, metastases, and normal tissue.
knitr::kable(head(samples, 10), format = "html")
| individual | tissue_type | |
|---|---|---|
| F001.C1.T | C1 | T |
| F002.C1.N | C1 | N |
| F003.C1.M | C1 | M |
| F004.C2.T | C2 | T |
| F005.C2.N | C2 | N |
| F007.C3.T | C3 | T |
| F009.C4.T | C4 | T |
| F010.C4.N | C4 | N |
| F014.C5.N | C5 | N |
| F017.C7.T | C7 | T |
In this section, we will create a data matrix from our dataset.
In order to perform statistical testing, we need a design matrix that defines our model. Notice that in our dataset, there are two factors:
Hence, ideally, we would like to account for both factors in our design matrix.
Recall that we want to find the genes that are differentially expressed in the tumor and metastases samples in contrast to normal tissues, so we first factor the tissue types such that “N” type becomes the baseline/reference level. This would affect on which tissue type would be chosen as the control value (i.e. intercept) when performing model.matrix function to generate the design matrix.
# Set normal tissue type as the intercept
samples$tissue_type <- factor(samples$tissue_type)
samples$tissue_type <- relevel(samples$tissue_type, ref="N")
# Doesn't really matter for individual people, but we can set C1 as the intercept for the sake of tidiness.
samples$individual <- factor(samples$individual)
samples$individual <- relevel(samples$individual, ref="C1")
# Generate design matrix
model_design <- model.matrix(~ samples$individual + samples$tissue_type)
model_design[1:5,1:5]
(Intercept) samples$individualC10 samples$individualC11 samples$individualC12 samples$individualC13
1 1 0 0 0 0
2 1 0 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
For our downstream analysis, we are going to use edgeR, which is specifically designed for RNASeq data. First, we create the base edgeR object called DGEList. The group we want to define is the tissue type.
d <- edgeR::DGEList(counts = normalized_counts, group = samples$tissue_type)
For further processing, we choose to use the quasi-likelihood models since our dataset is from an RNASeq experiment and quasi-likelihood models are best suited to handle RNASeq data.
One important underlying assumption for using the Quasi-likelihood model is that the data follows a negative binomial distribution. We need to verify that our dataset indeed meets that assumption.
To do this, we calculate the dispersion and generate the plot to visualize the mean-variance relationship.
d <- edgeR::estimateDisp(d, model_design)
# Generate MV plot
edgeR::plotMeanVar(d,
show.raw.vars = TRUE,
show.tagwise.vars = TRUE,
NBline = TRUE,
show.ave.raw.vars = TRUE,
show.binned.common.disp.vars = TRUE,
main = "Mean-Variance Plot for Normalized Data")
# Display legend
legend("topleft",
legend=c("Raw Data", "Tagwise Dispersion", "Average Raw Variances",
"Binned Common Dispersion", "Negative Binomial Line"),
col = c("grey", "lightblue", "maroon", "red", "dodgerblue2"), pch=c(1,1,4,4,NA), lty=c(0,0,0,0,1), lwd=c(1,1,1,1,2), cex=0.6)
From the MV plot, we can see that our normalized count data follows the negative binomial distribution, where it clearly aligns with the blue line indicating the negative binomial trend.
Now, we have created the design matrix and verified the assumption for the data to be negative-binomially distributed, we can proceed to the next stage of our analysis. We fit the model using our design matrix:
fit <- edgeR::glmQLFit(d, model_design)
Once we have fit the model, we can proceed to calculate differential expression. We will perform the calculation separately for Tumor vs. Normal, and Metastases vs. Normal. To ensure that the significantly differentially expressed genes are not obtained by random, we will perform correction for multiple hypothesis testing using Benjamini-Hochberg approach.
In this section, we want to test for differential expression between the tumor samples and normal samples.
qlf_tn <- edgeR::glmQLFTest(fit, coef='samples$tissue_typeT')
qlf_tn_hits <- edgeR::topTags(qlf_tn,sort.by = "PValue", adjust.method = "BH",
n = nrow(normalized_counts))
knitr::kable(head(qlf_tn_hits$table), format = "html")
| logFC | logCPM | F | PValue | FDR | |
|---|---|---|---|---|---|
| CLDN16 | 5.384950 | 5.772560 | 122.90420 | 0 | 0 |
| LHX2 | 3.473253 | 1.701027 | 130.64741 | 0 | 0 |
| HS6ST2 | 4.133289 | 4.429133 | 110.57323 | 0 | 0 |
| PRR15 | 5.114034 | 5.685377 | 107.41459 | 0 | 0 |
| SLIT1 | 4.810897 | 5.765108 | 100.52563 | 0 | 0 |
| ABCC11 | 3.961639 | 2.420472 | 99.50698 | 0 | 0 |
NA
We can examine the number of genes pass the threshold and correction.
We are using 0.05 as the threshold for p-value as it is commonly used in
practice.
Number of genes that passed the threshold:
length(which(qlf_tn_hits$table$PValue < 0.05))
[1] 7145
Number of genes that passed correction:
length(which(qlf_tn_hits$table$FDR < 0.05))
[1] 5648
The threshold of 0.05 gives us quite a lot of genes. However, since
we are dealing with human disease, which should be more strict to obtain
more meaningful hits for further analysis, we choose to use a more
stringent threshold of 0.01.
Number of genes that passed the 0.01 p value threshold:
length(which(qlf_tn_hits$table$PValue < 0.01))
[1] 5156
Number of genes that passed correction of 0.01 threshold:
length(which(qlf_tn_hits$table$FDR < 0.01))
[1] 3893
There are still a lot of genes left for further analysis. In
later sections, genes that have differentially expressed for less than 2
log fold change (LFC) would also be filtered out to obtain a set of more
meaningful data.
In this section, we test for differential expression between the metastasis tissues and the normal tissues.
qlf_mn <- edgeR::glmQLFTest(fit, coef='samples$tissue_typeM')
qlf_mn_hits <- edgeR::topTags(qlf_mn,sort.by = "PValue", adjust.method = "BH",
n = nrow(normalized_counts))
knitr::kable(head(qlf_mn_hits$table), format = "html")
| logFC | logCPM | F | PValue | FDR | |
|---|---|---|---|---|---|
| CDH16 | -6.791713 | 5.728732 | 195.1855 | 0 | 0 |
| CWH43 | -4.967247 | 3.916405 | 187.8185 | 0 | 0 |
| CHGA | -4.588762 | 1.320610 | 647.5490 | 0 | 0 |
| CLCNKB | -3.512404 | 3.970571 | 170.1807 | 0 | 0 |
| CCDC146 | -2.524562 | 4.409113 | 161.5505 | 0 | 0 |
| APOA1 | -4.051688 | 2.087150 | 160.0036 | 0 | 0 |
NA
We apply the 0.01 threshold for the same reason as indicated
previously.
Number of genes that passed the 0.01 p value threshold:
length(which(qlf_mn_hits$table$PValue < 0.01))
[1] 6590
Number of genes that passed correction with 0.01 threshold:
length(which(qlf_mn_hits$table$FDR < 0.01))
[1] 5477
Again, we would also filter out the genes that have differentially expressed less than 2 LFC in later sections.
We plot the volcano plot to visualize differential expression of genes for tumor vs. normal tissue samples. Each gene is represented by a point in the plot. The horizontal axis of the plot is the log2 fold change and the vertical axis is the -log10p which indicates how likely the differential of a gene is due to actual biological variation.
# Make all data to grey first
volcano_color_tn = rep('gray', times = nrow(qlf_tn_hits$table))
# For those down-regulated genes with more than 2 LFC and passes 0.01 FDR
# correction threshold, make it blue
volcano_color_tn[qlf_tn_hits$table$logFC < 0 &
qlf_tn_hits$table$FDR < 0.01
& abs(qlf_tn_hits$table$logFC) > 2] <- 'blue'
# For up-regulated genes with more than 2LFC and passes 0.01 FDR
# correction threshold, make it red
volcano_color_tn[qlf_tn_hits$table$logFC > 0 &
qlf_tn_hits$table$FDR < 0.01 &
abs(qlf_tn_hits$table$logFC) > 2] <- 'red'
# Plot the volcano plot
plot(qlf_tn_hits$table$logFC,
-log(qlf_tn_hits$table$PValue, base=10),
col = volcano_color_tn,
xlab = "log2 fold change",
ylab = "-log10 p",
main = "Volcano plot for Tumor vs. Normal Tissues"
)
# Add the legend
legend("topleft", legend=c("Downregulated in tumor tissues","Upregulated in tumor tissues", "Not significant"),
fill = c("blue", "red", "grey"), cex = 0.5)
# Label genes with over 4 LFC
tn_sig <- which(qlf_tn_hits$table$logFC > 5 | qlf_tn_hits$table$logFC < -4)
text(x = qlf_tn_hits$table$logFC[tn_sig] , y = -log10(qlf_tn_hits$table$PValue[tn_sig]),
label = rownames(qlf_tn_hits$table)[tn_sig], cex = 0.5, adj = c(1, NA), pos = 3)
There is extensive number of upregulated genes in tumor tissues
compared to normal tissues, with the highest of exceeding 6 LFC (log2
Fold Change). Several genes are significantly downregulated, having more
than 4 LFC.
We also plot the volcano plot to visualize differential expression of genes for metastases vs. normal tissue samples.
# Make all data to grey first
volcano_color_mn = rep('gray', times = nrow(qlf_mn_hits$table))
# For those down-regulated genes with more than 2 LFC and passes 0.01 FDR
# correction threshold, make it blue
volcano_color_mn[qlf_mn_hits$table$logFC < 0 &
qlf_mn_hits$table$FDR < 0.01
& abs(qlf_mn_hits$table$logFC) > 2] <- 'blue'
# For up-regulated genes with more than 2LFC and passes 0.01 FDR
# correction threshold, make it red
volcano_color_mn[qlf_mn_hits$table$logFC > 0 &
qlf_mn_hits$table$FDR < 0.01 &
abs(qlf_mn_hits$table$logFC) > 2] <- 'red'
# Plot the volcano plot
plot(qlf_mn_hits$table$logFC,
-log(qlf_mn_hits$table$PValue, base=10),
col = volcano_color_mn,
xlab = "log2 fold change",
ylab = "-log10 p",
main = "Volcano plot for Metastases vs. Normal Tissues"
)
# Add the legend
legend("topright", legend=c("Downregulated in tumor tissues","Upregulated in tumor tissues", "Not significant"),fill = c("blue", "red", "grey"), cex = 0.5)
# Label genes with over 5 LFC
mn_sig <- which(qlf_mn_hits$table$logFC > 5 | qlf_mn_hits$table$logFC < -5)
text(x = qlf_mn_hits$table$logFC[mn_sig] , y = -log10(qlf_mn_hits$table$PValue[mn_sig]), label = rownames(qlf_mn_hits$table)[mn_sig], cex = 0.5, adj = c(1, NA), pos = 3)
There is extensive number of upregulated and downregulated genes in metastases tissues compared to normal tissues, with the highest of exceeding 5 LFC.
In order to find the most significantly differentially expressed, we pick the hits that pass a P-value threshold of 0.01, and has absolute value of LFC larger than 2 as the top hits.
# Get top hit genes for T vs N
top_hits_tn <- rownames(qlf_tn_hits)[qlf_tn_hits$table$PValue < 0.01 & abs(qlf_tn_hits$table$logFC) > 2]
Then, we calculate their log2 counts per million value. We add 1 to the value to eliminate mathematical error when the count is 0. This technique is commonly used in practise when calculating log values.
# Calculate logCPM
hm_matrix <- log2(normalized_counts + 1) # Plus 1 to eliminate error when the count is 0.
Obtain data for each category
# Obtain tumor and normal tissue samples
tumor <- grep("T$", colnames(hm_matrix), value=TRUE)
normal <- grep("N$", colnames(hm_matrix), value=TRUE)
metastases <- grep("M$", colnames(hm_matrix), value=TRUE)
Scale heat map matrix by rows.
# Scale heat map matrix by rows
hm_matrix_tn <- t(scale(t(hm_matrix[rownames(hm_matrix) %in% top_hits_tn, c(tumor, normal)])))
Pick the color to be used for the plot. If heatmap only contains non-negative values, we will use only white and red, where white indicates 0, and red indicates positive values.
Otherwise, we will use blue to represent negative values.
We plot the heatmap to visualize differential expression of genes for metastases vs. normal tissue samples.
# Choose colors to be used
if(min(hm_matrix_tn) == 0){
heatmap_color = colorRamp2(c( 0, max(hm_matrix_tn)),
c( "white", "red"))
} else {
heatmap_color = colorRamp2(c(min(hm_matrix_tn), 0,
max(hm_matrix_tn)), c("blue", "white", "red"))
}
# Create the heatmap from the given matrix, showing the dendrogram for both genes and samples.
name = "Expression level"
hm_tn <- Heatmap(as.matrix(hm_matrix_tn),
show_row_dend = TRUE, show_column_dend = TRUE,
col = heatmap_color, show_column_names = TRUE,
show_row_names = FALSE, show_heatmap_legend = TRUE,
column_names_gp = grid::gpar(fontsize = 6),
heatmap_legend_param = list(title = name))
draw(hm_tn,
column_title="Tumor vs. Normal Tissue Heatmap for Top DE Genes",
column_title_gp=grid::gpar(fontsize=12))
Now we can also plot heatmap for Metastases vs. Normal comparison, with same steps as above.
# Get top hit genes for M vs N
top_hits_mn <- rownames(qlf_mn_hits)[qlf_mn_hits$table$PValue < 0.01 & abs(qlf_mn_hits$table$logFC) > 2]
# Scale heat map matrix by rows
hm_matrix_mn <- t(scale(t(hm_matrix[rownames(hm_matrix) %in% top_hits_mn, c(metastases, normal)])))
# Choose colors to be used
if(min(hm_matrix_mn) == 0){
heatmap_color = colorRamp2(c( 0, max(hm_matrix_mn)),
c( "white", "red"))
} else {
heatmap_color = colorRamp2(c(min(hm_matrix_mn), 0,
max(hm_matrix_mn)), c("blue", "white", "red"))
}
# Create the heatmap from the given matrix, showing the dendrogram for both genes and samples.
hm_mn <- Heatmap(as.matrix(hm_matrix_mn),
show_row_dend = TRUE, show_column_dend = TRUE,
col = heatmap_color, show_column_names = TRUE,
show_row_names = FALSE, show_heatmap_legend = TRUE,
column_names_gp = grid::gpar(fontsize = 6),
heatmap_legend_param = list(title = name))
draw(hm_mn,
column_title="Metastases vs Normal Tissue Heatmap for Top DE Genes",
column_title_gp=grid::gpar(fontsize=12))
At first, I employed the commonly used p-value of 0.05, resulting in 7145 genes for the tumor vs. normal tissue analysis, and 8634 genes for the metastases vs. normal tissue analysis, before multiple hypothesis testing corrections. However, this was an extensive set of genes, prompting us to alter the p-value threshold to 0.01, and FDR threshold also to 0.01, to restrict the number of genes to be included. Additionally, we imposed a criterion that a gene should have an absolute log2 fold change of at least 2. This ensured that we only obtained genes with significant differential expression, which we believed to be meaningful and essential for further analysis.
The two primary approaches for controlling false discovery rate are Bonferroni and Benjamini-Hochberg corrections, and we employed the Benjamini-Hochberg method for multiple hypothesis testing correction. Our objective was to identify significant hits without omitting meaningful ones. Bonferroni method is useful when the number of tests is small and when the tests are independent of each other, but it becomes overly stringent and impractical when the number of tests is large or when the tests are correlated. Since we are dealing with a large dataset with over 80 samples, Bonferroni is not desirable for our purpose. Therefore, we opted for Benjamini-Hochberg correction, which provided a more comprehensive set of genes for downstream analysis. Following this correction, we found that the tumor vs. normal tissue analysis yielded 3893 genes, and the metastases vs. normal tissue analysis yielded 5477 genes that passed the correction.
Volcano plots for both Tumor vs. Normal and Metastases vs. Normal datasets are shown above. The most significantly differentially expressed genes are labeled out by their HUGO symbols in the figures.
There are clear clusterings within conditions, as shown in the graph represented in red and blue. This suggests that the tumor tissues and metastases tissues do have genes that are highly differentially expressed compared to normal tissues.
For the final part of this assignment, we will perform a thresholded overrepresentation analysis using g:Profiler. In the previous section, we have compiled a list of differentially expressed genes. Here, we want to further divide them into upregulated and downregulated genes.
First, we obtain the upregulated and downregulated genes from the tumor vs. normal tissue dataset.
upregulated_tn <- qlf_tn_hits[qlf_tn_hits$table$logFC > 0 & qlf_tn_hits$table$PValue < 0.01, ]
downregulated_tn <- qlf_tn_hits[qlf_tn_hits$table$logFC < 0 & qlf_tn_hits$table$PValue < 0.01, ]
We use the R package for g:Profiler to perform the gene enrichment analysis. For correction, we used FDR as it is less stringent than Bonferroni and is introduced to be the preferred correction method in class. We used GO Biological Process, GO Molecular Function, and WP as those are the ones used previously in Journal entry assignments, which we are more familiar with.
Then, we perform analysis separately for up-regulated genes and down-regulated genes.
To obtain the terms these genes are involved in, we use the gost function from gProfiler2 package.
tn_up_top_terms_all <- gprofiler2::gost(query = rownames(upregulated_tn),
organism = "hsapiens",
exclude_iea = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
tn_up_top_terms <- data.frame(
term_name = tn_up_top_terms_all$result$term_name[tn_up_top_terms_all$result$term_size < 500 &
tn_up_top_terms_all$result$term_size > 1],
term_id = tn_up_top_terms_all$result$term_id[tn_up_top_terms_all$result$term_size < 500 &
tn_up_top_terms_all$result$term_size > 1],
source = tn_up_top_terms_all$result$source[tn_up_top_terms_all$result$term_size < 500 &
tn_up_top_terms_all$result$term_size > 1]
)
knitr::kable(head(tn_up_top_terms, 10), format = "html")
| term_name | term_id | source |
|---|---|---|
| external encapsulating structure organization | GO:0045229 | GO:BP |
| extracellular matrix organization | GO:0030198 | GO:BP |
| extracellular structure organization | GO:0043062 | GO:BP |
| cell projection morphogenesis | GO:0048858 | GO:BP |
| cell part morphogenesis | GO:0032990 | GO:BP |
| plasma membrane bounded cell projection morphogenesis | GO:0120039 | GO:BP |
| response to wounding | GO:0009611 | GO:BP |
| neuron projection morphogenesis | GO:0048812 | GO:BP |
| cell junction assembly | GO:0034329 | GO:BP |
| chemotaxis | GO:0006935 | GO:BP |
For context, let’s examine the top term from each data source.
knitr::kable(rbind(tn_up_top_terms[tn_up_top_terms$source == "GO:BP",][1,],
tn_up_top_terms[tn_up_top_terms$source == "REAC",][1,],
tn_up_top_terms[tn_up_top_terms$source == "WP",][1,]),
format = "html")
| term_name | term_id | source | |
|---|---|---|---|
| 1 | external encapsulating structure organization | GO:0045229 | GO:BP |
| 416 | Extracellular matrix organization | REAC:R-HSA-1474244 | REAC |
| 453 | Malignant pleural mesothelioma | WP:WP5087 | WP |
Generate a Manhattan plot to visualize the distribution of the top terms from each data source.
gprofiler2::gostplot(tn_up_top_terms_all) %>% plotly::layout(title = "Manhattan plot for Upregulated genes (Tumor vs. Normal)", font = list(size = 10))
Number of terms:
length(tn_up_top_terms$term_name)
[1] 472
We do the same for the downregualted genes.
tn_down_top_terms_all <- gprofiler2::gost(query = rownames(downregulated_tn),
organism = "hsapiens",
exclude_iea = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
tn_down_top_terms <- data.frame(
term_name = tn_down_top_terms_all$result$term_name[tn_down_top_terms_all$result$term_size < 500 &
tn_down_top_terms_all$result$term_size > 1],
term_id = tn_down_top_terms_all$result$term_id[tn_down_top_terms_all$result$term_size < 500 &
tn_down_top_terms_all$result$term_size > 1],
source = tn_down_top_terms_all$result$source[tn_down_top_terms_all$result$term_size < 500 &
tn_down_top_terms_all$result$term_size > 1]
)
knitr::kable(head(tn_down_top_terms, 10),format = "html")
| term_name | term_id | source |
|---|---|---|
| aerobic respiration | GO:0009060 | GO:BP |
| cellular respiration | GO:0045333 | GO:BP |
| oxidative phosphorylation | GO:0006119 | GO:BP |
| electron transport chain | GO:0022900 | GO:BP |
| proton motive force-driven mitochondrial ATP synthesis | GO:0042776 | GO:BP |
| generation of precursor metabolites and energy | GO:0006091 | GO:BP |
| proton motive force-driven ATP synthesis | GO:0015986 | GO:BP |
| energy derivation by oxidation of organic compounds | GO:0015980 | GO:BP |
| respiratory electron transport chain | GO:0022904 | GO:BP |
| ATP synthesis coupled electron transport | GO:0042773 | GO:BP |
Top terms from the data sources:
knitr::kable(rbind(tn_down_top_terms[tn_down_top_terms$source == "GO:BP",][1,],
tn_down_top_terms[tn_down_top_terms$source == "REAC",][1,],
tn_down_top_terms[tn_down_top_terms$source == "WP",][1,]),
format = "html")
| term_name | term_id | source | |
|---|---|---|---|
| 1 | aerobic respiration | GO:0009060 | GO:BP |
| 97 | The citric acid (TCA) cycle and respiratory electron transport | REAC:R-HSA-1428517 | REAC |
| 116 | Electron transport chain: OXPHOS system in mitochondria | WP:WP111 | WP |
Plot the Manhattan plot to visualize distribution of terms from each data source for downregulated genes.
gprofiler2::gostplot(tn_down_top_terms_all) %>%
plotly::layout(title = "Manhattan plot for Downregulated genes (Tumor vs. Normal)", font = list(size = 10))
Number of terms
length(tn_down_top_terms$term_name)
[1] 126
Finally, we analyze for all differentially expressed genes (i.e. both up-regulated and down-regulated genes as a whole).
tn_top_terms_all <- gprofiler2::gost(query = rownames(qlf_tn_hits),
organism = "hsapiens",
exclude_iea = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
tn_top_terms <- data.frame(
term_name = tn_top_terms_all$result$term_name[tn_top_terms_all$result$term_size < 500 &
tn_top_terms_all$result$term_size > 1],
term_id = tn_top_terms_all$result$term_id[tn_top_terms_all$result$term_size < 500 &
tn_top_terms_all$result$term_size > 1],
source = tn_top_terms_all$result$source[tn_top_terms_all$result$term_size < 500 &
tn_top_terms_all$result$term_size > 1]
)
knitr::kable(head(tn_top_terms, 10), format = "html")
| term_name | term_id | source |
|---|---|---|
| positive regulation of cell migration | GO:0030335 | GO:BP |
| ribonucleoprotein complex biogenesis | GO:0022613 | GO:BP |
| autophagy | GO:0006914 | GO:BP |
| process utilizing autophagic mechanism | GO:0061919 | GO:BP |
| positive regulation of cell motility | GO:2000147 | GO:BP |
| positive regulation of locomotion | GO:0040017 | GO:BP |
| ribosome biogenesis | GO:0042254 | GO:BP |
| vasculature development | GO:0001944 | GO:BP |
| regulation of amide metabolic process | GO:0034248 | GO:BP |
| regulation of mitotic cell cycle | GO:0007346 | GO:BP |
Top terms from the data sources:
knitr::kable(rbind(tn_top_terms[tn_top_terms$source == "GO:BP",][1,],
tn_top_terms[tn_top_terms$source == "REAC",][1,],
tn_top_terms[tn_top_terms$source == "WP",][1,]),
format = "html")
| term_name | term_id | source | |
|---|---|---|---|
| 1 | positive regulation of cell migration | GO:0030335 | GO:BP |
| 896 | RHO GTPase cycle | REAC:R-HSA-9012999 | REAC |
| 1316 | VEGFA-VEGFR2 signaling pathway | WP:WP3888 | WP |
gprofiler2::gostplot(tn_top_terms_all) %>% plotly::layout(title = "Manhattan plot for All DE genes (Tumor vs. Normal)", font = list(size = 10))
There a total of 1447 terms:
length(tn_top_terms$term_name)
[1] 1447
Again, we perform the analysis for Metastases vs. Normal tissue comparison. First, we obtain the up-regulated genes and down-regulated genes separately.
upregulated_mn <- qlf_mn_hits[qlf_mn_hits$table$logFC > 0 & qlf_mn_hits$table$PValue < 0.01, ]
downregulated_mn <- qlf_mn_hits[qlf_mn_hits$table$logFC < 0 & qlf_mn_hits$table$PValue < 0.01, ]
Obtain all terms involved for up-regulated genes using GO:BP, REAC and WP data source.
mn_up_top_terms_all <- gprofiler2::gost(query = rownames(upregulated_mn),
organism = "hsapiens",
exclude_iea = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
mn_up_top_terms <- data.frame(
term_name = mn_up_top_terms_all$result$term_name[mn_up_top_terms_all$result$term_size < 500 &
mn_up_top_terms_all$result$term_size > 1],
term_id = mn_up_top_terms_all$result$term_id[mn_up_top_terms_all$result$term_size < 500 &
mn_up_top_terms_all$result$term_size > 1],
source = mn_up_top_terms_all$result$source[mn_up_top_terms_all$result$term_size < 500 &
mn_up_top_terms_all$result$term_size > 1]
)
knitr::kable(head(mn_up_top_terms, 10), format = "html")
| term_name | term_id | source |
|---|---|---|
| T cell activation | GO:0042110 | GO:BP |
| leukocyte cell-cell adhesion | GO:0007159 | GO:BP |
| regulation of leukocyte cell-cell adhesion | GO:1903037 | GO:BP |
| regulation of T cell activation | GO:0050863 | GO:BP |
| positive regulation of leukocyte cell-cell adhesion | GO:1903039 | GO:BP |
| regulation of lymphocyte activation | GO:0051249 | GO:BP |
| regulation of cell-cell adhesion | GO:0022407 | GO:BP |
| positive regulation of cell-cell adhesion | GO:0022409 | GO:BP |
| positive regulation of T cell activation | GO:0050870 | GO:BP |
| lymphocyte proliferation | GO:0046651 | GO:BP |
For context, let’s examine the top term from each data source.
knitr::kable(rbind(mn_up_top_terms[mn_up_top_terms$source == "GO:BP",][1,],
mn_up_top_terms[mn_up_top_terms$source == "REAC",][1,],
mn_up_top_terms[mn_up_top_terms$source == "WP",][1,]),
format = "html")
| term_name | term_id | source | |
|---|---|---|---|
| 1 | T cell activation | GO:0042110 | GO:BP |
| 936 | Viral mRNA Translation | REAC:R-HSA-192823 | REAC |
| 1079 | TYROBP causal network in microglia | WP:WP3945 | WP |
We can visualize the distribution of top terms from each data source using an Manhattan plot.
gprofiler2::gostplot(mn_up_top_terms_all) %>% plotly::layout(title = "Manhattan plot for upregulated genes (Metastases vs. Normal)", font = list(size = 10))
Number of terms:
length(mn_up_top_terms$term_name)
[1] 1168
We do the same for the downregualted genes.
mn_down_top_terms_all <- gprofiler2::gost(query = rownames(downregulated_mn),
organism = "hsapiens",
exclude_iea = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
mn_down_top_terms <- data.frame(
term_name = mn_down_top_terms_all$result$term_name[mn_down_top_terms_all$result$term_size < 500 &
mn_down_top_terms_all$result$term_size > 1],
term_id = mn_down_top_terms_all$result$term_id[mn_down_top_terms_all$result$term_size < 500 &
mn_down_top_terms_all$result$term_size > 1],
source = mn_down_top_terms_all$result$source[mn_down_top_terms_all$result$term_size < 500 &
mn_down_top_terms_all$result$term_size > 1]
)
knitr::kable(head(mn_down_top_terms, 10), format = "html")
| term_name | term_id | source |
|---|---|---|
| cilium organization | GO:0044782 | GO:BP |
| aerobic respiration | GO:0009060 | GO:BP |
| cellular respiration | GO:0045333 | GO:BP |
| cilium assembly | GO:0060271 | GO:BP |
| generation of precursor metabolites and energy | GO:0006091 | GO:BP |
| oxidative phosphorylation | GO:0006119 | GO:BP |
| plasma membrane bounded cell projection assembly | GO:0120031 | GO:BP |
| cell projection assembly | GO:0030031 | GO:BP |
| energy derivation by oxidation of organic compounds | GO:0015980 | GO:BP |
| electron transport chain | GO:0022900 | GO:BP |
knitr::kable(rbind(mn_down_top_terms[mn_down_top_terms$source == "GO:BP",][1,],
mn_down_top_terms[mn_down_top_terms$source == "REAC",][1,],
mn_down_top_terms[mn_down_top_terms$source == "WP",][1,]),
format = "html")
| term_name | term_id | source | |
|---|---|---|---|
| 1 | cilium organization | GO:0044782 | GO:BP |
| 116 | The citric acid (TCA) cycle and respiratory electron transport | REAC:R-HSA-1428517 | REAC |
| 132 | Electron transport chain: OXPHOS system in mitochondria | WP:WP111 | WP |
Plot the Manhattan plot showing distribution of terms from each data source using list of downregulated genes.
gprofiler2::gostplot(mn_down_top_terms_all) %>%
plotly::layout(title = "Manhattan plot for downregulated genes (Metastases vs. Normal)", font = list(size = 10))
Number of terms returned:
length(mn_down_top_terms$term_name)
[1] 149
Finally, for all differentially expressed genes.
mn_top_terms_all <- gprofiler2::gost(query = rownames(downregulated_mn),
organism = "hsapiens",
exclude_iea = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
mn_top_terms <- data.frame(
term_name = mn_top_terms_all$result$term_name[mn_top_terms_all$result$term_size < 500 &
mn_top_terms_all$result$term_size > 1],
term_id = mn_top_terms_all$result$term_id[mn_top_terms_all$result$term_size < 500 &
mn_top_terms_all$result$term_size > 1],
source = mn_top_terms_all$result$source[mn_top_terms_all$result$term_size < 500 &
mn_top_terms_all$result$term_size > 1]
)
knitr::kable(head(mn_top_terms, 10), format = "html")
| term_name | term_id | source |
|---|---|---|
| cilium organization | GO:0044782 | GO:BP |
| aerobic respiration | GO:0009060 | GO:BP |
| cellular respiration | GO:0045333 | GO:BP |
| cilium assembly | GO:0060271 | GO:BP |
| generation of precursor metabolites and energy | GO:0006091 | GO:BP |
| oxidative phosphorylation | GO:0006119 | GO:BP |
| plasma membrane bounded cell projection assembly | GO:0120031 | GO:BP |
| cell projection assembly | GO:0030031 | GO:BP |
| energy derivation by oxidation of organic compounds | GO:0015980 | GO:BP |
| electron transport chain | GO:0022900 | GO:BP |
Top terms from each data source for all differentially expressed genes (Metastases vs. Normal)
knitr::kable(rbind(mn_top_terms[mn_top_terms$source == "GO:BP",][1,],
mn_top_terms[mn_top_terms$source == "REAC",][1,],
mn_top_terms[mn_top_terms$source == "WP",][1,]),
format = "html")
| term_name | term_id | source | |
|---|---|---|---|
| 1 | cilium organization | GO:0044782 | GO:BP |
| 116 | The citric acid (TCA) cycle and respiratory electron transport | REAC:R-HSA-1428517 | REAC |
| 132 | Electron transport chain: OXPHOS system in mitochondria | WP:WP111 | WP |
And again, we plot the Manhattan plot to visualize the terms enrichment:
gprofiler2::gostplot(mn_top_terms_all) %>%
plotly::layout(title = "Manhattan plot for All DE genes
(Metastases vs. Normal)", font = list(size = 10))
Number of terms:
length(mn_top_terms$term_name)
I chose g:Profiler because it was discussed in lectures, and previous Journal Entry assignment has introduced the g:Profiler to us with its usage. It provides several analysis methods and visualizations for genomic data, which meets our need. Moreover, I had previous experience using its web-interface, and after learning that it also has R package utility, this is a good chance to try it out.
I choose GO:BP, Reactome, and WikiPathways for annotation because they were previously mentioned in the Journal Assignment, which were also used on human genes, and these three are very comprehensive datasets for human pathways. The version I am using is as follows: - GO:BP releases/2022-12-04 - REAC releases/2022-12-28 - WP releases/20221210
For all three analysis (using upregulated, downregulated, all differentially expressed) for both tumor vs. normal tissue and metastases vs. normal tissue, we used a threshold of adjusted p value of 0.01, and limit term size between 1 and 500. We set the upper bound to 500 because we do not want to include overly broad and generic terms that will not give us meaningful insights into the roles of the differentially expressed genes. The P value threshold is set to 0.01.
Tumor vs. Normal:
Upregulated genes returned 490 gene sets;
Downregulated genes returned 154 genes ets; All differentially expressed
genes returned 1151 gene sets.
Metastases vs. Normal:
Upregulated genes returned 1180 gene
sets; Downregulated genes returned 162 gene sets; All differentially
expressed genes returned 162 gene sets.
Tumor vs. Normal:
Taking all DE genes together returned more
gene sets than the results for upregulated and downregulated separately,
and the results for upregulated genes are about three times more than
the downregulated genes. This suggests that the upregulated genes may be
more strongly associated with specific biological processes or pathways
than the downregulated genes. Similarly, the fact that the combined set
of DE genes returned more gene sets than either the upregulated or
downregulated genes alone suggests that there may be some shared
biological processes or pathways that are affected by both upregulated
and downregulated genes.
Metastases vs. Normal:
Upregulated genes returned around 10
times more terms than the result for all DE genes as a whole. This
indicates that the upregulated genes are enriched for certain biological
functions or pathways more strongly than the overall set of
differentially expressed genes.
Not too much, the over-representation results doesn’t strongly support the conclusions and mechanism discussed in the original paper, probably due to the annotation sources that I choose that yelds different results. However, terms such as regulation of mitotic cell cycle showed up within the analysis result for up-regulated gene sets for tumor tissues and metastases tissues. In the original paper, the authors predicted that the tumors would have TFs that interact with the MAPK pathway and regulate gene expression in a way that is relevant to the development and progression of cancer, and indeed they found that TFs in the MAPK pathway are actively bound significantly more in tumor and metastases than in normal tissue. Transcription factors are widely know to play a role in regulating gene expresison, which would also affect mitotic cell cycle, indicating that our results support the conclusions discussed in the original paper in some way.
There are publications describing how transcription factors regulates mitotic cell cycle. In the review paper “Coordinating gene expression during the cell cycle” by Martin.F et. al, they discussed the control mechanisms for mitotic cell cycles in mammals, which include activating genes with peak expression in G1/S by E2F transcription factors (TFs), which is required for DNA synthesis. (Fischer et al. (2022)).